Control of false discoveries in grouped hypothesis testing for eQTL data

Background Expression quantitative trait locus (eQTL) analysis aims to detect the genetic variants that influence the expression of one or more genes. Gene-level eQTL testing forms a natural grouped-hypothesis testing strategy with clear biological importance. Methods to control family-wise error rate or false discovery rate for group testing have been proposed earlier, but may not be powerful or easily apply to eQTL data, for which certain structured alternatives may be defensible and may enable the researcher to avoid overly conservative approaches. Results In an empirical Bayesian setting, we propose a new method to control the false discovery rate (FDR) for grouped hypotheses. Here, each gene forms a group, with SNPs annotated to the gene corresponding to individual hypotheses. The heterogeneity of effect sizes in different groups is considered by the introduction of a random effects component. Our method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR), assumes a model for alternative hypotheses for the eQTL data and controls the FDR by adaptive thresholding. As a convenient alternate approach, we also propose Z-REG-FDR, an approximate version of REG-FDR, that uses only Z-statistics of association between genotype and expression for each gene-SNP pair. The performance of Z-REG-FDR is evaluated using both simulated and real data. Simulations demonstrate that Z-REG-FDR performs similarly to REG-FDR, but with much improved computational speed. Conclusion Our results demonstrate that the Z-REG-FDR method performs favorably compared to other methods in terms of statistical power and control of FDR. It can be of great practical use for grouped hypothesis testing for eQTL analysis or similar problems in statistical genomics due to its fast computation and ability to be fit using only summary data. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05736-3.


Background
Expression quantitative trait locus (eQTL) analysis aims to detect genetic loci that are associated with the expression of one or more genes [1].For each gene, expression can be considered as a quantitative trait potentially associated with the genotypes at different sites in the genome, typically single nucleotide polymorphisms (SNPs) [2].Although there is a substantial literature on both eQTL mapping [3][4][5] and grouped hypothesis testing [6][7][8], consideration of the natural gene-level grouping of the SNPs, e.g., SNPs local to a gene for a cis-eQTL problem, is comparatively unexplored or requires permutation methods or approximations [9,10].Analysis of gene-level eQTLs and meaningful consideration of causal SNPs is an important biological problem [11].Testing whether there is any eQTL (local SNP) for an entire gene while controlling the false discovery rate (FDR) across the set of all genes may be interesting for various reasons, which has been imperfectly addressed in the "e-Gene" concept employed by the GTEx Consortium [12].
Local (cis) eQTL testing includes tests of individual SNPs nearby a gene, which leads to summaries at the gene level [12].The natural hierarchical organization would suggest standard methods for group-level testing [6,13].However, local eQTL testing can include additional structure to be exploited: (i) the number of cis-eQTLs is typically large, so that explicit consideration of the proportion and "strength" of alternatives is possible; (ii) the conditional analyses of discovered eQTLs suggest that, to a first approximation, most local eQTLs can be considered unique within the region [14]; (iii) correlation of test statistics is driven by regional SNP correlation.
In the following sections, we discuss the structure of eQTL data and how the grouped nature can be effectively modeled using a random effects model.We consider the case of cis-eQTLs, i.e. local to the gene [14,15], where the variant affecting the gene expression is in the immediate neighborhood of the gene.Our proposed method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR), uses an empirical Bayes framework to model the eQTL data and controls the FDR by adaptive thresholding.We also propose an alternate approach Z-REG-FDR, an approximate version of REG-FDR, that uses only the summary measures given by the Z-statistics of association between genotype and expression for each gene-SNP pair.We demonstrate using simulations and real data analysis that this approximate version performs well compared to other possible approaches while having a much faster computation time.

Structure of the eQTL data and the hypotheses
eQTL data can typically be expressed in the form of an expression matrix, consisting of N genes along with a genotype matrix which has genotypes (m SNPs) for the same n sample units.We denote the expression matrix as Y N ×n and the genotype subma- trix corresponding to the ith gene as X (i) m i ×n , i = 1, 2, . . ., N , where m i is the number of SNPs local to the ith gene.Linear modeling of eQTLs typically includes additional covariates, such as expression cofactors [12,16].The t-statistics for the partial correlations between Y and X (i) m i ×n , after both are adjusted for covariates, are equivalent to the Wald statistics for the X (i) m i ×n when conducting the full linear regression in which Y is modeled as a function of X (i) m i ×n and the additional covariates [17,18].We assume that the sample size n is large enough that the residual degrees of freedom for the t statistic is sufficient to use a standard normal approximation.Thus we can can directly work with z-statistics for Y and X (i) m i ×n , where we consider each of these matrices to have been covariate-residualized.
Let H 0ij denote the gene-SNP level null hypothesis that there is no eQTL at the jth SNP local to the ith gene, j = 1, 2, . . ., m i , i = 1, 2, . . ., N .Therefore there are N i=1 m i gene-SNP level tests.These tests can be grouped into N groups corresponding to the N genes with m i tests in the ith group.Define H 0i to be the gene-level null hypothesis for the ith gene that there is no eQTL for the ith gene.Therefore the gene-level null hypothesis H 0i can be written as i.e. the gene-level null requires that all of the corresponding m i gene-SNP level hypoth- eses be null.

An empirical Bayes model
We adopt an empirical Bayes approach for controlling the gene-level FDR.Empirical Bayes approaches have been used in many genetic applications, and indeed these applications have been a prime motivator for the methods [19,20].The advantages of using an empirical Bayes approach based on the local false discovery rate (lfdr), instead of p-value-based FDR-controlling approaches, has been discussed in [21] and [22].The lfdr corresponding to the gene-level null hypothesis H 0i is Here Y i denotes the ith row of the matrix Y.If we obtain the lfdr i for each of the gene- level hypotheses, we can control the FDR at target level α for gene-level testing, using the following adaptive thresholding procedure, which has been used extensively in the literature [7,[23][24][25].
Reject hypotheses H 0i 1 , . . ., H 0i L where L is the largest integer such that [24] and subsequently [7] showed that the adaptive thresholding procedure is valid in the sense that it controls the FDR at target level α for an 'oracle' procedure where the true parameters of the model are assumed to be known.It is asymptotically valid for a 'data-driven' procedure when the parameters are consistently estimated from the data.[25] proved its validity under further relaxed conditions.The proof makes use of the following result (Averaging Theorem, [19]).
Let lfdr(z) = P(H 0 |z) denote the lfdr for observed data z.Then, for a rejection region R , the FDR will be given by The adaptive thresholding procedure can be used to control the FDR for testing the gene-level hypotheses H 0i 's and a similar procedure can be used to test the gene-SNP (1) level hypotheses H 0ij 's.However, obtaining the gene-level lfdr's is a non-trivial problem.In the next section, we propose a model which enables us to calculate the lfdr's.

The random effects model and testing procedure for group-level FDR control (REG-FDR)
Here we propose a model to obtain the gene-level lfdr values, that can be subsequently used to test the gene-level hypotheses while controlling the FDR using the adaptive thresholding method.The model is based on the following assumptions.

A1. For any gene i, under the gene-level alternative hypothesis H c
0i , there exists a single causal SNP that influences its expression.A2.Each of the m i SNPs has equal probability to be the causal SNP.
First, we note that Assumption (A1) is at best a simplification, but very large eQTL studies have supported the view that most genes with eQTLs have a primary local eQTL [26], with other loci having much smaller effect sizes.We therefore treat A1 as a 'workable condition' [27][28][29].
Assumption (A2) could easily be relaxed, and one might use a distributional assumption over the SNPs as a modest modification of our method below (see the Discussion section).We note that it is trivial to enforce Assumption (A2) by, for example, randomizing the SNP identities within gene i prior to analysis.
Under these assumptions, the gene-level lfdr for the ith gene has the following form: where is the conditional density under the alternative given that the jth SNP is causal.Here β ij is correlation between the expression of the ith gene and the causal SNP j.Note that the marginal density p(X (i) ) cancels from numerator and denominator.Importantly, this cancellation allows us to bypass the modeling of the dependence structure of the SNPs, which otherwise might have been difficult to estimate accurately.
We assume that f 0 (.) is the density of the N n (0, I n ) distribution (noting that expression data can be normalized), and that f 1 (. j .This choice of f 1 ensures that the unconditional variance of Y i is free of β ij .To account for variability across genes, we assume β ij to be a random effect such that √ n − 3 tanh −1 (β ij ) follows a N (0, σ 2 ) dis- tribution.As β ij is a correlation coefficient, the Fisher transformation is used to ensure that the variance does not depend on the mean.Moreover, σ will be estimated from the data, and so the apparent dependence on n is not important to the procedure.
Our procedure treats the genotype values as fixed, and assuming the expression of genes to be independent, given genotypes, we can estimate π 0 and σ using a maxi- mum likelihood approach and follow with plug-in estimates to obtain estimates The assumption that the expression of different genes are independent is violated in general, but our approach can be viewed as employing a composite likelihood [30], and thus consistent for π 0 and σ even under independence violations [31].An EM algorithm is used (see Additional file 1: Section 1) for the maximum likelihood estimation.The procedure enables us to use the adaptive thresholding procedure to provide proper gene-level control of the FDR.

The Z-REG-FDR model
One computational challenge presented by the REG-FDR model is that the density does not have a closed form expression.While it can be expressed as the following integral numeric maximum likelihood estimation is computationally burdensome.We propose an alternative model, termed Z-REG-FDR, which avoids this problem.In this approach, we consider the Fisher transformed and scaled z-statistics as our data.Thus, for each gene i, we have a vector of z-statistics ) and r (i) j is the sample correlation of Y i and X (i) j .Fisher transformation and scaling ensures that z (i) is approximately normal and that the variance of each component is approximately 1 under both null and alternative.Under the null, the mean of z (i) is zero.We treat the component z (i) as if they are inde- pendent across different genes, again relying on approximate conditional independence (given genotypes) and a compositie likelihood interpretation.
The Z-REG-FDR procedure is based on an additional assumption to (A1) and (A2) above.If the kth SNP is causal, we assume (Assumption (A3)) that the distribution of (z k under the alternative is same as that under the null.In particular, we note that this assumption is true if the components of z (i) have a Markov dependence structure with the same serial correlation under null and alternative, which is true in the special case that the successive marker correlations are zero.In general, this assumption can be violated, but as shown in "Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation" section, the resultant procedure appears to work well in many circumstances as an approximate maximum likelihood method even when Assumption (A3) is not satisfied.
Under the above assumptions, we can write the joint distribution of the random vector z under the null, and under the alternative.We assume p 0 (.) to be N(0, 1) and p 1 (.) to be N (µ, 1) , where µ is assumed to be random with a N (0, σ 2 ) distribution.We do not assume anything about the form of f 0|k except that it does not involve the parameters π 0 and σ .Under these assumptions, the gene-level lfdr for this model reduces to This follows from the cancellation of ) in the numerator and denominator.While estimating π 0 and σ , a similar cancellation helps us bypass max- imizing the full (approximate) likelihood Instead, we maximize This is equivalent to the maximum likelihood estimation under the assumption that f 0|k does not involve the parameters π 0 and σ .Note that we need to estimate only the param- eters π 0 and σ to obtain the gene-level lfdr using Eq. 9.
When the required assumptions are not satisfied, this method still has value as an approximate maximum likelihood approach.For instance, when the X (i) j 's are related by an AR(1) structure, it can be shown that the correlation between the z-statistics depends on the effect size, i.e. the correlation between Y i and the causal SNP, hence violating Assumption (A3).Additional file 1: Lemma 1 and Additional file 1: Figures 1 and 2 show the extent to which the conditional distribution f 0|k might depend on the effect size for any correlation structure among normally distributed SNPs.However, our results in "Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation" section demonstrate that it does not have a significant adverse effect on the performance of the estimation and control of false discovery.

Simulations: performance of Z-REG-FDR when all assumptions are satisfied
First, we conducted a simulation study to explore the performance of Z-REG-FDR under the ideal situation where all assumptions are satisfied.Table 1 shows the results for simulated datasets (1000 simulations of datasets with 10,000 genes and 200 samples) where z's are directly simulated from an autoregressive structure, and therefore Assumption (A3) is also satisfied.The estimates are accurate to within about 15% when the true σ is at least 2.0.The control of the FDR is also satisfactory for σ > 2 .However, the performance is not as good for small σ , which is due to the fact that it is difficult to separate the null and alternative cases when the effect sizes ( 9) are small; this is true even when all the assumptions are satisfied.This is a property of the two group mixture model in the empirical Bayes set up, and not a limitation due to the approximate nature of Z-REG-FDR.

Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation
We wished to study the accuracy of the estimation under the approximations employed and for a relatively small sample size, in order to ensure that the approach can work in this challenging situation.Accordingly, we simulated data that uses the covariate adjusted genotype matrix of a real dataset from the GTEx project (V3) [12].The genotype matrix corresponding to the tissue 'heart' , which had 83 samples, was selected for analysis.For computational purposes, 10,000 genes were chosen randomly from 28,991 genes.Use of genotype matrices from real data ensures that we are not enforcing Assumption (A3) while simulating, and our choice of f 0|k for the simulation is obtained from the data.We simulate the Y i 's (1,000 simulations) using the following scheme.
1.For each gene, decide whether it has an eQTL using a Bernoulli(π 0 ) distribution.2. If the gene has an eQTL, pick a causal SNP using a discrete uniform distribution over the m i SNPs.Let it be the kth SNP.

If the gene has an eQTL, simulate each element of
If the gene doesn't have an eQTL, simulate each element of Y i from N(0, 1).
Table 2 shows the results for this data, indicating that the estimates are still accurate and control of FDR is satisfactory unless σ is very small.Large eQTL studies have observed large effect sizes for cis-eQTL analysis [15,32] which implies that σ is not expected to be very small.Thus our numerical results indicate that the Z-REG-FDR method has valid applications for eQTL data.
Figure 1 shows the plot of REG-FDR estimates against the Z-REG-FDR estimates for 500 simulated datasets using the simulation scheme described above.It is clear from the plot that the two methods agree with each other (with correlations 0.906 and 0.952 for π 0 and σ , respectively) and largely fall near the unit line.These results suggest that the approximate maximum likelihood method in Z-REG-FDR is quite effective in controlling the FDR, with a much improved computation speed-a few minutes on a single computer for a dataset with 10,000 genes and 100-200 samples  as opposed to more than a day for REG-FDR.A comparison of the estimated lfdr and estimated FDR of the two methods is shown in Fig. 2. It is evident that the slight overestimation of π 0 and the slight underestimation of σ by Z-REG-FDR work in opposite directions, which leads to similar lfdr values when compared to REG-FDR.The correlation between the estimated FDR based on the true values of the parameters and that based on REG-FDR or Z-REG-FDR are also very high (see Additional file 1: Figure 3).

Behavior of the expected pseudo-log-likelihood of Z-REG-FDR
It is a standard result that the expected log-likelihood is maximized at the true value of the parameter under standard regularity conditions [33].Since REG-FDR is the true maximum likelihood method for the proposed model, it is expected to satisfy this property.If Assumption (A3) is not satisfied then Z-REG-FDR is an approximate maximum likelihood method, and as such, its pseudo-log-likelihood need not be maximized at the true value of the parameter.We explored several realistic combinations of the true parameters and observed that the pseudo-log-likelihood of Z-REG-FDR is maximized very near the true parameter value.It is a difficult task to analytically compute the expected pseudo-log-likelihood, and so Monte-Carlo integration was used for this task.
Figure 3 shows the expected pseudo-log-likelihood surface of Z-REG-FDR for π 0 = 0.2 and σ = 3 .A contour plot also confirms the fact the surface peaks near the true values of the parameters.

Simulations: comparison of Z-REG-FDR with other methods
It is possible to use other methodologies to control the FDR in grouped hypothesis testing problem for eQTL data.A conservative approach is to obtain the Bonferroni adjusted p-values for each gene, where the p-value for each gene-SNP pair is computed based on the usual t-test or z-test, and then use an FDR controlling approach [eg 34, Fig. 3 Demonstration of the optimization of log-likelihood properties using Z-REG-FDR method.A. Surface plot and B. Contour plot of expected pseudo-log-likelihood surface for the Z-REG-FDR method.True π 0 and σ are 0.2 and 3 respectively 35,36] to assess the conservative p-values.[29] used a permutation based ("eGene") approach in their analysis of the GTEx data.The method uses the smallest gene-SNP p-value for a gene as the test statistic and computes its distribution by permuting the expression values.Such a distribution can be used to obtain p-values for each gene, which can subsequently be used to control the FDR by using methods such as Storey's q-value method [35].The Bonferroni method is typically conservative and hence less powerful.The permutation method, while correctly controlling false positives, can suffer from lack of power to detect genes having an eQTL since it uses an extreme value statistic (not based on likelihood).Our model, on the other hand, utilizes more information through its use of approximate likelihood.We carried out a simulation study to compare the performance of the methods in terms of their power.The simulations were performed using the simulation scheme described in "Simulations: performance of Z-REG-FDR as an approximate maximum likelihood estimation" section and statistical power was obtained using an FDR threshold of 0.05.The results are shown in Fig. 4. As expected, the Bonferroni method turned out to have very low power and is not shown in Fig. 4. The permutation approach with Storey's q-value method [35] was conservative and less powerful in comparison with Z-REG-FDR.To address the possible concern that Z-REG-FDR can be slightly anti-conservative, and therefore the comparison with the permutation method is unfair, we also included an adjusted version of the Z-REG-FDR method where a slightly lower FDR threshold was chosen based on the simulations in such a way that the estimated FDR was exactly 0.05.This adjusted version had slightly less power compared to unadjusted Z-REG-FDR, but was more powerful than the permutation method.

Analysis of real data
Finally, we also applied the Z-REG-FDR on a real dataset obtained from GTEx (V6) [12].Besides Z-REG-FDR, we also used the permutation method and Simes method [37], which is expected to be more powerful than the Bonferroni method although it may not control the FDR for all types of correlation structures.We applied each method on the GTEx data for 44 tissues, separately for each tissue.
For each tissue, the normalized gene expression data and SNP genotype data were separately residualized after adjusting for covariates provided by GTEx.We fit a linear regression model with individuals' gene expression or SNP genotype as the response variable and covariates as the explanatory variables.Then we extracted the model residuals to obtain "covariate-corrected" gene expression and SNP genotypes.Figure 5 shows a comparison of the number of significant genes found by Z-REG-FDR and the permutation method employed by [12].A complete list of the sample sizes and the number of significant genes discovered for the 44 tissues is provided in Additional file 1: Table 2.The methods agree with each other to some extent in terms of number of discoveries.The Z-REG-FDR method has higher number of discoveries compared to the Permutation method and the Simes method in most cases.The parameter estimates for each tissue using Z-REG-FDR are shown in Fig. 6.

Discussion
We have introduced a principled procedure to perform gene-level FDR test, most appropriate and useful in the eQTL setting.A major advantage of Z-REG-FDR is its computational efficiency.While other methods such as the permutation method or our REG-FDR method can take days on a single PC to complete the analysis of a real eQTL dataset, Z-REG-FDR can do the same in a few minutes.For instance, it takes approximately two minutes to fit the model and find significant genes by Z-REG-FDR for a data set with 4.5 million SNPs grouped as local SNPs for 10, 000 genes.REG-FDR takes about a day, and the permutation method (for 10, 000 permutations) takes about 6 hours to analyze the same data.Since there are thousands of simultaneous tests, even 10, 000 permutations may not be enough to provide sufficient p-value resolution.While the Bonferroni method is very fast, it has little power to detect the genes having true eQTLs.
Z-REG-FDR has additional advantages.One important feature of the method is that it does not require access to the full data.In fact, the symmetry of the distributions involved in the Z-REG-FDR pseudo-likelihood ensure that only the gene-SNP level p-values (or equivalently the absolute z-values) are needed to fit the model.Z-REG-FDR does not model the correlation structure of the SNPs, and therefore does not require access to that data.This might be very useful since, in many genetic applications, data are found in the form of summary measures.
Z-REG-FDR can be slightly anti-conservative depending on the true values of the parameters.Various simulations show that if σ is large, which appears to often be the case for eQTL data, the control of FDR is satisfactory.The fact that Assumption (A3) is not satisfied does not significantly affect the FDR control.Therefore the assumption can be thought of as a means to reduce computational burden, rather than a necessary assumption for the practical workability of the model.
Assumptions (A1) and (A2) also have the potential to be relaxed, although we consider that to be beyond the scope of this paper.For example, the method can be extended by relaxing Assumption (A2) and incorporating a non-uniform prior for the causal location.If a well-grounded prior exists, then it can be incorporated into our method in a straightforward manner using weighted versions of our statistics.We have included an example in the Additional file 1 to demonstrate empirical evidence that the method remains valid even for more than one causal SNPs under certain conditions.
Our use of the lfdr statistics, while valid, does not utilize gene-level local correlation structures [38][39][40][41] that might provide additional power.Implementation of such methods would require sensitive estimation of gene-level correlations, and a possible direction of future effort.
With the continuous increase in the size of genomic data sets, and with the possibility of further extensions of our approach, we strongly believe that the approximate likelihood approach of the Z-REG-FDR method can be of great practical use for grouped hypothesis testing for eQTL analysis or similar problems in statistical genomics.

Fig. 1 Fig. 2 A
Fig.1Comparison of the parameter estimates using REG-FDR and Z-REG-FDR.Except a small number of cases, the two estimates agree with each other.The blue lines show the true values of the parameters

Fig. 4
Fig. 4 Power curves of different methods for varying combinations of the true parameter values

Fig. 5 Fig. 6
Fig. 5 Comparison of Z-REG-FDR and the permutation method for GTEx data

Table 1
Summary of the simulation studies with directly simulated z from an AR(1) model with correlation ρThe last two columns show the FDR control performance of the Z-REG-FDR method when the target FDR is 5% and 10%, respectively

Table 2
Summary of the simulation studies using the SNP matrix from real data.The last two columns show the FDR control performance of the Z-REG-FDR method when the target FDR is 5% and 10%, respectively